Fitting and evaluating logistic regression models with lrm
The Framingham example
Outcome: chd10 = Developed coronary heart disease in next 10 years?
Use lrm to model chd10 using four predictors
on the complete cases (fram_cc)
accounting for missingness via single imputation
accounting for missingness via multiple imputation
Consider adding non-linear terms, refit and re-evaluate
Today’s R Setup
knitr::opts_chunk$set(comment =NA)library(cutpointr) ## NEW: specifies "optimal" cutpointslibrary(caret) ## for creating a confusion matrixlibrary(pROC) ## should come after cutpointrlibrary(ROCR) ## NEW: alternative to pROC for plotting ROC curveslibrary(janitor)library(broom)library(naniar)library(mice) ## we'll use for single imputation todaylibrary(rms) ## also (automatically) loads Hmisclibrary(tidyverse)theme_set(theme_bw())
While the only four predictors we’ll use today for chd10 are educ, glucose, sbp and smoke, we’ll impute all of the missing values, using the complete set of 17 variables.
Imputation via mice
We need to impute:
5 quantities (glucose, bmi, cigs, chol and hrate)
1 binary variable (bp_meds), and
1 multi-categorical variable (educ)
We have missing data in 13.7% of our observations, so we’ll use mice to create 15 imputed data sets, and then save one of them as our “single imputation” tibble.
set.seed(432432)fram_mice15 <-mice(fram_orig, m =15, printFlag =FALSE)
fram_start contains 4238 rows and the 6 columns we’ll use, with 388 rows missing glucose and 105 missing educ.
fram_cc: (complete cases) includes only the 3753 complete rows for our 6 columns.
fram_si: singly imputed to yield 4238 rows on our 6 columns with complete data.
Modeling Plan
Use lrm to fit a four-predictor logistic regression model to predict chd10 using glucose, smoker, sbp and educ
Using the complete cases (fram_cc)
Accounting for missingness via single imputation (fram_si)
Accounting for missingness via multiple imputation, via aregImpute()
Then, we’ll consider adding several non-linear terms to the “four-predictor” models, and refit.
Fitting a Four-Predictor Model using Complete Cases
A “Four Predictor” model
First, we’ll use the fram_cc data to perform a complete-case analysis and fix ideas.
d <-datadist(fram_cc)options(datadist ="d")mod_cc <-lrm(chd10 ~ glucose + smoker + sbp + educ,data = fram_cc, x =TRUE, y =TRUE)
This works very nicely when chd10 = 1 (for Yes) or 0 (for No), as it does here. What if your outcome was actually a factor with values Yes and No? Use the following…
mod_cc <- lrm((outcome == "Yes") ~
glucose + smoker + sbp + educ,
data = fram_cc, x = TRUE, y = TRUE)
Main Output for mod_cc
mod_cc
Logistic Regression Model
lrm(formula = chd10 ~ glucose + smoker + sbp + educ, data = fram_cc,
x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 3753 LR chi2 223.29 R2 0.100 C 0.682
0 3174 d.f. 6 R2(6,3753)0.056 Dxy 0.363
1 579 Pr(> chi2) <0.0001 R2(6,1469)0.137 gamma 0.364
max |deriv| 2e-11 Brier 0.122 tau-a 0.095
Coef S.E. Wald Z Pr(>|Z|)
Intercept -5.5622 0.3217 -17.29 <0.0001
glucose 0.0081 0.0016 4.93 <0.0001
smoker 0.3126 0.0955 3.27 0.0011
sbp 0.0237 0.0020 12.05 <0.0001
educ=HS grad -0.4674 0.1157 -4.04 <0.0001
educ=Some Coll -0.3924 0.1423 -2.76 0.0058
educ=Coll grad -0.1356 0.1549 -0.88 0.3815
Deconstructing mod_cc summaries, 1
Logistic Regression Model
lrm(formula = chd10 ~ glucose + smoker + sbp + educ, data = fram_cc,
x = TRUE, y = TRUE)
Obs = 3753 0 = 3174 1 = 579 max |deriv| 2e-11
Obs = Observations used to fit model, with 0 = the # of zeros and 1 = the # of ones in our outcome, chd10.
max |deriv| is the maximum absolute value of the derivative at the point where the maximum likelihood function was estimated.
All we care about is whether the iterative function-fitting process converged, and R will warn you if it doesn’t.
Deconstructing mod_cc summaries, 2
Model Likelihood Ratio Test: LR chi2 = 223.29, d.f. = 6 Pr(> chi2) <0.0001
This is a global likelihood ratio test (drop in deviance test.)
Likelihood Ratio \(\chi^2\) = null deviance - residual deviance
The area under the ROC curve, or C statistic, shown as C.
Somers’ d statistic, symbolized Dxy here.
Let’s walk through each of those, in turn.
Key Indexes (Nagelkerke \(R^2\))
The Nagelkerke \(R^2\) reaches 1 if the fitted model shows as much improvement as possible over the null model (which just predicts the mean response on the 0-1 scale for all subjects).
Nagelkerke \(R^2\) is 0 for the null model, and is larger (closer to 1) as the fitted model improves, although it’s criticized for being misleadingly high,
A Nagelkerke \(R^2\) value of 0.100 doesn’t mean 10% of anything.
Here, Nagelkerke \(R^2\) = 0.100 indicates fairly low quality of fit.
An Alternative: McFadden’s \(R^2\)
McFadden R-square = 1 minus the ratio of (the model deviance over the deviance for the null model.)
To obtain this for our mod_cc run with lrm, use:
1- (mod_cc$deviance[2] / mod_cc$deviance[1])
[1] 0.069174
This McFadden \(R^2\) corresponds well to the proportionate reduction in error interpretation of an \(R^2\), if that’s all you need.
Key Indexes (Brier Score = 0.122)
The lower the Brier score, the better the predictions are calibrated.
The maximum (worst) score is 1, the best is 0.
From Wikipedia: Suppose you forecast the probability P that it will rain tomorrow.
If the forecast is P = 1 (100%) and it rains, the Brier Score is 0.
If the forecast is P = 1 (100%) and it doesn’t rain, the Brier Score is 1.
If the forecast is P = 0.7 and it rains, Brier = \((0.70 - 1)^2 = 0.09\).
If the forecast is P = 0.3 and it rains, Brier = \((0.30 - 1)^2 = 0.49\).
If the forecast is P = 0.5, the Brier score is \((0.50 - 1)^2 = 0.25\) regardless of whether it rains.
Is collinearity a problem?
rms::vif(mod_cc)
glucose smoker sbp educ=HS grad educ=Some Coll
1.011147 1.036391 1.047904 1.134669 1.106597
educ=Coll grad
1.100638
Receiver Operating Characteristic Curve Analysis
One way to assess the predictive accuracy within the model development sample in a logistic regression is to consider analyses based on the receiver operating characteristic (ROC) curve. ROC curves are commonly used in assessing diagnoses in medical settings, and in signal detection applications.
The accuracy of a test can be evaluated by considering two types of errors: false positives and false negatives.
C = 0.682 (area under ROC curve)
The C statistic and Somers’ d (Dxy) are connected:
\[
C = 0.5 + \frac{d}{2}, d = 2(C - .5)
\]
The C statistic ranges from 0 to 1.
C = 0.5 describes a prediction that is exactly as good as random guessing
C = 1 indicates a perfect prediction model, one that guesses “yes” for all patients with chd10 = 1 and which guesses “no” for all patients with chd10 = 0.
Most of the time, the closer to 1, the happier we are:
\(C \geq 0.8\) usually indicates a moderately strong model (good discrimination)
\(C \geq 0.9\) indicates a very strong model (excellent discrimination)
We’ll add a restricted cubic spline with 5 knots in sbp
and an interaction between the educ factor and the linear effect of sbp,
and a quadratic polynomial in glucose
to our main effects model, just to show how to do them…
I’ll just show the results including the multiple imputation, since if you can get those, you should have little difficulty instead applying the single imputation or the complete case analysis.